arXiv:1509.05188vl [q-bio.GN] 17 Sep 2015 


The Bulk and The Tail of Minimal Absent Words 
in Genome Sequences 

Erik Aurell * Nicolas Innocenti * and Hai-Jun Zhou ^ 

'^Department of Computational Biology, KTH Royal Institute of Technology, AlbaNova University Center, SE-10691 Stockholm, Sweden,f Department of Information and 
Computer Science, Aalto University, FI-02150 Espoo, Finland, and ^State Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, 
Beijing 100190, China 


Minimal absent words (MAW) of a genomic sequence are subse¬ 
quences that are absent themselves but the subwords of which are 
all present in the sequence. The characteristic distribution of ge¬ 
nomic MAWs as a function of their length has been observed to be 
qualitatively similar for all living organisms, the bulk being rather 
short, and only relatively few being long. It has been an open issue 
whether the reason behind this phenomenon is statistical or reflects 
a biological mechanism, and what biological information is contained 
in absent words. In this work we demonstrate that the bulk can be 
described by a probabilistic model of sampling words from random 
sequences, while the tail of long MAWs is of biological origin. We in¬ 
troduce the novel concept of a core of a minimal absent word, which 
are sequences present in the genome and closest to a given MAW. 
We show that in bacteria and yeast the cores of the longest MAWs, 
which exist in two or more copies, are located in highly conserved re¬ 
gions the most prominent example being ribosomal RNAs (rRNAs). 
We also show that while the distribution of the cores of long MAWs 
is roughly uniform over these genomes on a coarse-grained level, on a 
more detailed level it is strongly enhanced in 3’ untranslated regions 
(UTRs) and, to a lesser extent, also in 5’ UTRs. This indicates that 
MAWs and associated MAW cores correspond to fine-tuned evolu¬ 
tionary relationships, and suggest that they can be more widely used 
as markers for genomic complexity. 

Minimal absent word; copy-mutation evolution model; random sequence 

Abbreviations: AW, absent word; MAW, minimal absent word; rRNA, ribosomal 

RNA; UTR, untranslated region 

G enomic sequences are texts in languages shaped by evo¬ 
lution. The simplest statistical properties of these lan¬ 
guages are short-range dependencies, ranging from single¬ 
nucleotide frequencies (GC content) to fc-step Markov models, 
both of which are central to gene prediction and many other 
bioinformatic tasks [1]. More complex characteristics, such as 
abundances of fc-mers, sub-sequences of length fe, have applica¬ 
tions to classification of genomic sequences mEiiiis], and e.g. 
to fast computations of species abundancies in metagenomic 
data laiziiiii]. 

The reverse image of words present are absent words (AWs), 
subsequences which actually cannot be found in a text. In 
genomics the concept was first introduced around 15 years 
ago for fragment assembly |10l ITT] and for species identifica¬ 
tion na, and later developed for inter- and intra-species com¬ 
parisons [13111113 [HE] as well as for phylogeny construc¬ 
tion m- A practical application is to the design of molecular 
bar codes such as in the tagRNA-seq protocol recently intro¬ 
duced by us to distinguish primary and processed transcripts 
in bacteria m- Short sequences or tags are ligated to tran¬ 
script 5' ends, and reads from processed and primary tran¬ 
scripts can be distinguished in silica after sequencing based 
on the tags. For this to be possible it is crucial that the tags 
do not match any subsequence of the genome under study, 
i.e. that they correspond to absent words. In a further recent 
study we also showed that the same method allows to sepa¬ 
rate true antisense transcripts from sequencing artifacts giving 
a high-fidelity high-throughput antisense transcript discovery 


protocol m- In these as in other biotechnological applications 
there is an interest in finding short absent words, preferably 
additionally with some tolerance. 

Minimal absent words (MAW) are absent words which can¬ 
not be found by concatenating a substring to another absent 
word. All the subsequences of a MAW are present in the 
text. MAWs in genomic sequences have been addressed re¬ 
peatedly [ill El El El I18| as these obviously form a basis 
for the derived set of all absent words. Furthermore, while 
the number of absent words grows exponentially with their 
length m, because new AWs can be built by adding letters 
to other AWs, the number of MAWs for genomes shows a dras¬ 
tically different behavior, as illustrated below in Fig. [^a) and 
previously reported in the literature [mEES]. The behav¬ 
ior can be summarized as there being one or more shortest 
minimal absent word of a length which we will denote Zq, a 
maximum of the distribution at a length we will denote Imode, 
and a very slow decay of the distribution for large 1. In human 
Zo is equal to 11, as first found in m, Imode is equal to 18, 
there being about 2.25 billion MAWs of that length, while the 
support of the distribution extends to around 10® (Fig. [^c) 
and (d)). The total number of human MAWs is about eight 
billion. As already found in E] the very end of the distri¬ 
bution depends on the genome assembly; for human Genome 
assembly GRCh38.p2 the three longest MAWs are 1475836, 
831973 and 744232 nt in length. Several aspects of this dis¬ 
tribution are interesting. First, in a four-letter alphabet there 
are 4f° possible subsequences of length k, but in a text of length 
L only {L — k) subsequences of length k actually appear. If 
the human genome were a completely random string of letters 
one would therefore expect the shortest MAW to be of length 
15. The fact that Zq is considerably shorter (11) is therefore 
already an indication of a systematic bias, in [22] attributed 
to the hypermutability of CpG sites. We will return to this 
point below. More intriguing is the observation that the over¬ 
whelming majority of the MAWs lie in a smooth distribution 
around Imode, and then a small minority are found at longer 
lengths. We will call the hrst part of the distribution the bulk 
and the second the tail. We separate the tail from the bulk by 
a cut-off Zmax which we describe below; the human Zmax is 33, a 
typical number for larger genomes, while the Escherichia coli 
Zmax is 24. Using this separation there are about 35 million 
human tail MAWs, about 0.447% of the total, while there are 
7632 E. coli tail MAWs, about 0.053% of the total. The ef¬ 
fect is qualitatively the same for, as far as we are aware of, all 
eukaryotic, archeal and baterial genomes analyzed in the lit¬ 
erature |21II15 |. as well as all tested by us. Only a few viruses 
with short genomes are exceptions to this rule and show only 
the bulk, see Fig. W)- 

The questions we want to answer in this work are why the 
distributions of MAWs are described by the bulk and the tail. 
Gan these be understood quantitatively? Do they carry bio¬ 
logical information or are they some kind of sampling effects? 
Gan one make further observations? We will show that both 
the bulk and the tail can be described probabilistically, but in 
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Fig. 1. Distributions of the lengths of absent and minimal absent words in genomes and random texts, (a) Number of AWs and MAWs as a function of word length in 
the genome of E. faecalis v583. The number of AWs grows exponentially while the distribution of MAWs shows a maximum and a decay, (b) Comparison between the 
distribution of MAWs in E. faecalis and the ones for a random genome of the same size using different random models, (c) Distributions of MAWs for a few common 
organisms and viruses, (d) Lengths of a MAW as a function of its rank for the distributions shown in (c). 


two very difFerent ways. The bulk of the MAW distribution 
arises from sampling words from finite random sequences and 
are contained in an interval [Imin, ^max], where Imax was in¬ 
troduced above and Imin is a good predictor of lo, the actual 
length of the shortest AWs. To the best of our knowledge this 
has not been shown previously, and although our analysis uses 
only elementary considerations, they have to be combined in a 
somewhat intricate manner. The distribution of bulk MAWs, 
which comprise the vast majority of MAWs in all genomic 
sequences, can hence be seen as nothing more than a compli¬ 
cated transformation of simple statistical properties of the se¬ 
quence. In fact, excellent results are obtained taking only the 
single nucleotide composition into account (FigjHb)). Never¬ 
theless, the tail MAWs are different, and can bedescribed by 
a statistical model of genome growth by a copy-paste-mutate 
mechanism similar to the one presented in [23]. We show that 
the distributions of the tail MAWs vary, both in the data and 
in the model. The human and the mouse MAW tail distribu¬ 
tions follow approximately a power-law, but this seems to be 
more the exception than the rule; bacteria and yeast as well as 
e.g. Picea abies (Norway spruce) show a cross-over behavior to 
a largest MAW length. For bacteria this largest length ranges 
from hundreds to thousands; for P. abies it is around 30 000; 
while for human and mouse the tail MAW distribution reaches 
up to one million, without cross-over behavior (Fig[^d)). 

From the definition, any subword obtained by removing let¬ 
ters from the start or end of a MAW is present in the sequence. 
In particular, removing the first and last letters of a MAW 
leads to a subword that is present at least twice, which we 
denote here as a MAW core. MAWs made of a repeat of the 
same letter are an exception to this rule as they can have the 
two copies of their cores overlapping each other, see Appendix 
A in Supplemental Information. MAW cores can be consid¬ 
ered as the causes that create the MAWs and their location 
on the genome combined with functional information from the 
annotation tells us about their biological significance. Find¬ 
ing all the occurrences on a genome of a given word (such as 
a MAW core) is the very common bioinformatic task of align¬ 
ment, which can be done quickly and efficiently using one of 
the many software packages available. In bacteria and yeast. 


the cores from the longest MAWs are predominantly found in 
regions coding for ribosomal RNAs (rRNAs), regions present 
in multiple copies on the genome and under high evolutionary 
pressure as their sequence determines their enzymatic prop¬ 
erties, required for protein synthesis and vital to every living 
cell. At the global scale, it appears that MAW cores obtained 
from MAWs in the bulk are distributed roughly uniformly over 
the genome while those from the tail cluster in 3' UTRs and, 
to a lesser extent, also in 5' UTRs. These regions are impor¬ 
tant for post-transcriptional regulation, and thus likely to be 
under evolutionary pressure similarly to rRNAs. 

We end this Introduction by noting that from a linguistic 
perspective a language can be described by its list of forbid¬ 
den sub-sequences, or the list of its forbidden words [24] F] 
Minimal forbidden words relate to forbidden words as MAWs 
to absent words, and in a text of infinite length the lists of 
MAWs and minimal forbidden words would agree. If there is 
a finite list of minimal forbidden words the resulting language 
lies on the lowest level of regular languages in the Chomsky 
hierarchv |251 126 | . and is hence relatively simple, while a com¬ 
plex set of instructions, such as a genome, is expected to cor¬ 
respond to a more complex language, with many layers of 
meaning. Such aspects have been exploited in cellular au¬ 
tomata theory |27l I28 | and in dynamical systems theory |29| . 
and are perhaps relevant to genomics as well. The present 
investigation is however focused on properties of texts of fi¬ 
nite length, for which minimal forbidden words and MAWs 
are quite different. 


A random model for the bulk 

Let us consider a random sequence 5 of total length N with 
alphabet {A,C,G,T}. Each position of S is independently 
assigned the letter A, C, G, or T with corresponding proba¬ 
bilities uja, ujc, ojg and ojt {= 1 — coa — ujc — ojg). A word of 


^In genomics the concept of minimal absent words was first introduced in m. as “minimal for¬ 
bidden words”. Since this term has another well-established meaning we have here instead used 
MAW 1211 . Related concepts are “unwords” which are the shortest absent words (also short¬ 
est minimal absent words) and "nullomers” 1131 which are absent words without a requirement on 
minimality, compare data in Table 1 in 1131 . 
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length L has the generic form of w = Ci C 2 ... Ci_i cl, where 
Ci G {A, C, G,T} is the letter at the i-th position. The total 
number of such words is 4^. This number exceeds N when 
L increases to order ©(InA'^), therefore most of the words of 
length L > 0{lnN) will never appear in S. Then what is 
the probability gw of a particular word w being a MAW of 
sequence 5? 

For w to be a MAW, it must not appear in 5 but its 
two subwords of length {L — 1), = ciC 2 ...ci,_i and 

= C 2 ... cl -1 c_L, must appear in S at least once, as 
demonstrated schematically in Fig. [^a). We define the core 
of the MAW w as the substring 


which must appear in 5 at least twice, except for the spe¬ 
cial case of Cl = C 2 = ... = Ci where the and 
overlap (see Appendix A in Supplemental Information). The 
core must immediately follow the letter ci at least once and 
it must also be immediately followed by the letter cl at least 
once. Similarly, if immediately follows the letter ci, it 

must not be immediately followed by the letter cl- 

We can construct (N — L + 1) subsequences of length L from 
5, say 5i,52,...,5 jv-l+i. Neighboring subsequences are not 
fully independent as there is an overlap of length {L — m) be¬ 
tween Sn and Sn+m with 1 < m < L. However, for L N 
two randomly chosen subsequences of length L from the ran¬ 
dom sequence 5 have a high probability of being completely 
uncorrelated. We can thus safely neglect these short-range 
correlations, and consequentially the probability of word w 
being a MAW is expressed as 

fl ( 

gw = [1 - Cj(w)J 

— [l — -b tu(w)] , [1] 

where tu(w) = (with a being the i-th letter of w) 

is the probability of a randomly chosen subsequence of length 
L from 5 to be identical to the word w, while and 

a;(w^'’^) are, respectively, the probabilities of a randomly cho¬ 
sen subsequences of length (L — 1) from 5 being identical to 
w(J’) and 


A|-,G C|-|T C|-|G 

AI - IT is a MAW I-1 is the MAW core 

A|-IT is absent 

— is present at least 2x 

A|-IN is present at least lx with N^T 

N| — - - I T is present at least lx with N^A 

(a) 



(b) 


Fig. 2. (a) Illustration of the properties of a minima! absent word and its sub¬ 

words. (b) Comparison between the length distribution predicted by Eq. |^and the 
number of MAWs calculated for one instance of a random genome of 3.3 Mbp with 
uniform nucleotide distribution and with 37% GC content. 


Summing over all the 4^ possible words of length L, we ob¬ 
tain the expected number fl{L) = gw of MAWs of length 
L for a random sequence 5 of length N: 


Cl Cl riA,nc,riQ,'nrL 

v/h /. /, , ,^G, + l 

X > I J- UJc-^UJcj^UJ^Cff / 

h , , , TG, nyxiV-i-l-l 

11. ^G ^G ) 

, / T^/ TG, ,”G, nT\l^-L + l 

P UJq LOj, ) 

, /h / , \ riA nn nr' no-’\ | 

-b(l - (cUci +UJc^ - LUciUJc^)ujjPuj(AuJfpLjp ) > , 

[ 2 ] 


where the summation is over all the 16 combinations of the 
two terminal letters ci,cn and over all the possibilities with 
which the letters A, C, G, T, may appear in the core a total 
number of times equal to respectively tia, nc, no, and nr- 

In the simplest case of maximally random sequences, namely 
uja = ojc = OJG = u>T = 1, Eq. H reduces to 

n(L) = 4^(l-4"'^)'^“'^+^ 

We have checked by numerical simulations (see Fig. [^) that 
Eq. [^and Eq. [^indeed give excellent predictions of the num¬ 
ber MAWs as a function of their length in random sequences. 

We define a predicted minimum and a predicted maximum 
of the support of the bulk (Imin and Zmax) as the two values 
of L such that H(L) = 1. In the general case, requiring that 
^{L) > 1 we obtain L > /min, with the shortest length /min 
such that 


E 




nAlnclnGlnr! 
x(l - 


[4] 


is closest to one, while in the other limit we obtain that 
L < /max, with the longest length /max being 


/n 


21nAr 


— ln(cu^ -f -|- cUq -f 


[5] 


The bulk distribution is therefore centered around lengths of 
order logA^. In the case of maximally random sequences, we 
can obtain the lower limit analytically and also the first cor¬ 
rection to , as 


/min — 


In iV-In In AT 

hA 


[ 6 ] 


and 


^ 21niV + ln9 
“ ' hbl ■ 


[7] 


The above definition of /max is good enough for our purposes, 
and /min is also a good predictor for /q (see below). A more 
refined predictor for /min is discussed in Appendix B in the 
Supplemental Information. 


A random model for the tail 

We now describe a protocol for constructing random genome 
by a iterative copy-paste-mutation scheme that qualitatively 
reproduces the tail behavior observed for most of real genomes. 


E. Aurell, N. Innocenti H.-J. Zhou 


September 18, 2015 | 3 























The model is in principle similar to but differs in the de¬ 
tails of the implementation. 

The starting point is a string of nucleotides chosen inde¬ 
pendently at random with a length A^o- At each iteration, 
we chose two positions i and j uniformly at random on the 
genome and a length I from a Poisson distribution with mean 
A. We copy the sequence between i to {i + I — 1) and insert it 
between positions j and j -b 1, thus increasing the genome size 
by 1. We then randomly alter a fraction a of nucleotides in the 
genome, choosing the positions uniformly at random and the 
new letters from an arbitrary distribution that can be tuned 
to adjust the GC content. The process is repeated until the 
genome reaches the desired length. 

In this model, A represents the typical size of region in¬ 
volved in a translocation in the genome and a corresponds to 
the expected number of mutations between such events. We 
observed that the length Nq or the content of the initial string 
is unimportant provided that it is much shorter than the final 
genome size. The exact value of A given a constant a/A ra¬ 
tion only affects the tail of the distribution far away from the 
bulk. We have also checked by simulations that using differ¬ 
ent distributions for the choice of I does not affect the results 
qualitatively (See Fig. SI in Supplemental Information). 

A low ratio a/X generates genomes with tail MAWs while 
higher values cause them to only have bulk MAWs as in ran¬ 
dom texts discussed above (Fig. 1^. This is in agreement with 
the observations for viruses in Frg. Hepatitis B and H5N1 
are viruses that replicate using an error prone reverse tran¬ 
scription and the MAW distributions for their genomes lack 
the tail. In contrast, Human Herpers 5 virus and the Cafe¬ 
teria roenbergensis virus are DNA viruses that use the higher 
fidelity DNA replication mechanism and their genomes clearly 
have tail MAWs. 



Fig. 3. Length distributions for MAWs in a few random genomes 3 millions bp 
in size generated by the copy-paste-mutation protocol with different values for the 
ol/X ratio. The curves were obtained using A^o = 5000 and a constant A = 500. 


Estimating the length of the shortest absent words 

Equations (§ and 0 can be used to estimate the length of 
the shortest absent words. Fig . [4| compares the prediction of 
of the simplest estimate in Eq. I^o the length of the shortest 
MAW for a large set of genomes. 

The estimator Eq. [^is expected to be most accurate only 
for genomes with neutral GC content. The figure reveals that 
genomes of comparable sizes typically vary in their Iq by about 
4 nt, and that our estimator captures very well the upper val¬ 
ues in this distribution. Using Eq. only improves the pre¬ 
dictions for genomes with much biased GC content (40% or 


less) and leads to results in line with the earlier published es¬ 
timator by Wu et al. m, see Appendix C and Table SI in 
Supplemental Information. 

This analysis shows that, contrary to the conclusion of |22| . 
there is no need to invoke a biological mechanism to explain 
the length of the shortest MAW; it is instead a property of rare 
events when sampling from a random distribution. Indeed, the 
estimator Eq. gives the length of the shortest Human MAW 
as 12 and not 15, only one nucleotide away from the correct 
answer (11). 



Fig. 4. Length of the shortest MAW versus the genomes size for all viral, bacte¬ 
rial and archaeal genomes available on the NCBI database as well as a few arbitrarily 
chosen eukaryotes, including many short genomes and only a few complex organisms, 
such as Human, mouse and Norway spruce. The black line represents the estimator 
Eq0 

The origins of tail MAWs 

The Human Herpes virus 5, a double stranded DNA virus with 
a linear genome of ~ 235.6 kbp, has four very long MAWs with 
lengths of 2540 nt for two of them, and 1360 nt for two others, 
all other MAWs being much shorter (81 nt or less). The cores 
of these four MAWs come from three regions, two of them lo¬ 
cated at the very beginning and very end of the genome, and 
the third at position ~ 195 kbp made up of the juxtaposition 
of the reverse complements of the two others. These regions 
are annotated as repeated and regnlatory. Based on the NCBI 
BLAST webservice, these particular sequences are highly con¬ 
served (95% or more) in numerous strains of the virus and do 
not seem to have homologues in any other species: the closely 
related Human Herpes virus 2 shows sequences with no more 
than 42% similarities to these MAW cores. 

In E. coli and E. faecalis, the 10 longest MAWs with lengths 
between 2815 and 3029 nt all originate from rRNA regions: a 
set of genes present in a few copies made of highly conserved 
regions with minor variations between the copies. For yeast, 
the four longest MAWs (8376 nt) originate from the two copies 
of rRNA RDN37 on chromosome XH, another four (7620 nt) 
are caused by 2 copies of the region containing PAUl to VTH2 
on chromosome X and PAU4 to VTH2 on chromosome IX and 
two more (6531 nt) originate from the copies of gene YRF on 
chromosome VH and XVI (YRF is present in at least 8 copies 
on the yeast genome). 

We performed an extensive search for all occurrences of the 
cores of every MAW found in the organisms mentioned above 
and considered the density of MAW cores along the genome for 
MAWs in the tail (see Fig. S2 in Supplemental Information, ). 
Except for the Human Herpes virus 5 that only has few MAWs 
which cluster in the repeated segments discussed above, MAW 
cores do not appear to be preferentially located in any spe¬ 
cific regions on the genome scale. A more detailed analysis 
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Fig. 5. Average number of MAW cores for MAWs from the bulk and the tail around (a)-(c) ends (3’ UTRs) and (d)-(f) starts (5' UTRs) of annotated CDS for three 
living organisms (see also Fig, S2 in Supplemental Information). The signals are normalized so that their mean value over the window of interest is 1, The MAW cores 
concentrate in the UTRs, more strongly on the 3' side than on the 5’ side. 


(Fig. § however reveals that, while cores of MAWs from the 
bulk appear uniformly distributed, those from the tail clus¬ 
ter downstream of ends of annotated coding DNA sequences 
(CDS) (i.e. in the 3' UTRs and terminator sequences). A sim¬ 
ilar yet weaker effect can be observed upstream of the start 
of annotated CDS (the 5' UTR). By definition, a MAW core 
corresponds to a repeated region on the genome immediately 
surrounded by nucleotides varying between the copies. Exact 
repeated regions lead to only a few MAWs with cores corre¬ 
sponding to that repeat. Introducing a few random changes 
in such regions creates more but shorter MAWs, the cores of 
which are the sub-strings common to two or more regions. A 
high density of MAW cores in a family of regions such as the 
UTRs thus indicates that they share a limited set of building 
blocks, implying a similar set of evolutionary constraints or a 
common origin. 


The significance of the longest MAWs 

We now consider the lengths of the longest MAW found in the 
genomes of numerous organisms and viruses. We observe that 
this length generally lies between imax and a length of about 
10% of the genome size (Fig. [^. 

Viruses are the class showing the largest spread in the length 
of their longest MAWs. Many viruses are close to /max, par¬ 
ticularly for those with shorter genomes, confirming our pre¬ 
viously mentioned observation that some lack the tail and are 
thus closer to random texts. Nevertheless, a few viral genomes 
have MAWs longer than 10% of the genome length, i.e. pro¬ 
portionally longer than in any living organism. The figure 
suggests that bacteria have on average slightly longer MAWs 
than archaea, but overall no clear distinctions between the four 
types of genomes can be noted based on the length of MAWs 
alone, suggesting that the mechanisms behind evolution of all 
organisms and viruses influence the MAWs distribution in the 
same way. 

A more detailed analysis of the data presented in Fig. 
shows that organisms have the longest MAWs closest to the 
lower bound of the tail (/max) or the observed upper bound of 
10% of the genome length share some common traits. 

In bacteria, the 6 genomes having their longest MAWs 
closest to /max are two strains from the Buchnera aphidicola 
species, two strains of Candidatus Carsonella ruddii, one Can- 
didatus Phytoplasma solani and one Bacteroides uniformis. 
While the last is a putative bacterial species living in human 
feces [30], the five other species are intra-cellular symbiotic or 
parasitic gammaproteobacteria in plants or insects |31II32[|33| . 


Among eukaryotes, the same analysis gives us Plasmodium 
gaboni, an agent responsible for malaria [33] , a species of Cryp¬ 
tosporidium, another family of intracellular parasites found in 
drinking water and Chromera velia, a photosynthetic organism 
from the same apicomplexa phylum as plasmodium, which is 
remarkable in this class for its ability to survive outside a host 
and is of particular interest for studying the origin of photosyn¬ 
thesis in eukaryotes [33113S]. For Archaea, we find Candidatus 
Parvarchaeum acidiphilum and C. P. acidophilus, which are 
two organisms with short genomes (45.3 and 100 kbp) living 
in low pH drainage water from the Richmond Mine in Noth- 
ern California [32], and an uncultivated hyperthermophilic ar¬ 
chaea “SCGC AAA471-E16” of the Aigarchaeota phylum |38| . 
Additionally, we searched for MAWs in 2395 human mithocon- 
drial genomes with lengths between 15436 and 16579 bp and 
found that the longest MAWs are only 17 or 18 nt long, while 
/max — 16.5 for these genomes. 

Among bacteria having their longest MAW close to 10% of 
the genome length, we find several strains of E. coli, Fran- 
cisella tularensis, Shewanella baltica, Methylobacillus flagel- 
latus, Xanthomonas oryzae and a species of the Wolbachia 
genus. All of these are facultative or obligatory aerobes and 
are a lot more widespread than the bacteria listed in the pre¬ 
vious paragraph. At least the four first species are free-living 
and commonly cultured in labs. X. oryzae is a pathogen affect¬ 
ing rice residing in the intercellular spaces. Wolbachia species 
are a very common intracellular parasites or symbiotes living 
in arthropods. Among eukaryotes, in addition to human and 



Length of longest MAW (nt) 


Fig. 6. Length of the longest absent words as a function of the genome sizes 
for the same set of genomes as in Fig. The dashed line represents the estimator in 
Eq. 03 nd the dotted line y = 0.1a:. 
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mouse, we find Dictyostelium discoideum, an organism liv¬ 
ing in soil that changes from uni- to multi-cellular during its 
life cycle, and Thalassiosira Pseudonana a unicellular algea 
commonly found in marine phytoplankton. Finally, archaea 
with the longest MAWs are Methanococcus voltae, a mesophilic 
methanogen, Halobacterium salinarum, a halophilic marine 
obligate aerobic archeon also found in food such as salted pork 
or sausages, and Halalkalicoccus jeotgali, another halophilic 
archeon isolated from salted sea-food[39). 

To summarize, it appears from these results that develop¬ 
ment in specific environmental niches that offer rather stable 
conditions, and particularly the inside of the cell of another or¬ 
ganism, is, albeit with some exceptions, associated with short 
MAWs. On the other hand, aerobic life and widespread pres¬ 
ence in changing and diverse environments such as soil, sea 
water or food is generally associated with long MAWs. 

Intracellular organisms have a well known tendency to re¬ 
duce the size of their genomes and increase error rate for 
DNA replication due to elimination of error correction mech¬ 
anisms m- This translates into a high value for a in our ran¬ 
dom model for the tail and explains why their longest MAWs 
are short. In particular, we note that B. aphidicola (genome 
size of ~ 650 kbp), brought out by our analysis, is known 
to have the highest mutation rate among all prokaryotes and 
indeed its longest MAW is as short as 34 nt [iOl [41]. As 
for organisms specialized for a niche environment, one may 
hypothesize that proliferation speed is more important than 
replication fidelity m- 

Reasons for a widespread presence in changing environments 
to increase the length of the longest MAWs are less clear. Mul¬ 
ticellular eukaryotic organisms do not show particularly high 
DNA replication fidelity |41| . The fidelity of E. coli is fairly 
good at about 3.5x the one of human germline m, but it 
is unclear if this is enough to make it stand out from other 
bacteria. We speculate that soil, sea water or food, which are 
likely to contain many types of microorganisms, may favor 
species more likely to undergo horizontal gene transfers, thus 
increasing the rate of translocation events and decreasing a 
in our model, while leaving the per generation mutation rate 
unchanged. 

Conclusions 

We have proposed a two-parts model explaining the unusual 
shape of the length distribution of minimal absent words 
(MAWs) in genomes. The first part of the model quantita¬ 
tively reproduces the bulk of the distribution by considering 
the genome as a random text with random and independent 
letters. The second part is a stochastic algorithm grounded 
in basic principles of how genomes evolve, through transloca¬ 
tion events and mutations, that qualitatively reproduces the 
behavior in the tail. Our theory provides an estimator for the 
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length of the shortest MAWs that is remarkably simple and 
captures well the global trend observed in large numbers of 
genomes from all sorts of organisms and viruses. 

Considerations about the longest MAW in a genome reveal 
sets of organisms sharing common high-level features such as 
the type of environment they live in. We have shown argu¬ 
ments for believing that organisms and viruses having few tail 
MAWs do so because of a low replication fidelity. Why some 
organisms such as E. coli have long tail MAWs is less clear 
and replication fidelity alone does not seem to be a sufficient 
reason. 

Finally, we have introduced the concept of MAW cores, se¬ 
quences present on the genome that tell us about what causes 
the existence of their parent MAW. We have shown that, while 
cores from bulk MAWs do not seem biologically relevant, cores 
from tail MAWs cluster in regions of the genome with special 
roles, namely ribosomal RNAs and untranslated regions sur¬ 
rounding coding regions of genes, a feature that cannot be 
explained by a stochastic protocol that ignores the biological 
roles of the strings it manipulates. 

Materials and Methods 


Data source. Viral and bacterial genomes were downloaded under the form of the 
'’all.fna" archives from the "Genomes/Viruses/” and "Genomes/Bacteria” from the 
NCBI database on 17-18 May 2015 respectively. The Norway spruce's genome was 
downloaded from the “Spruce genome project" homepage and the yeast genome 
strain 288C I43I from Saccharomyces Genome Database. Genomes of other eukaryotes 
and archaeas were downloaded from the NCBI database at several different dates over 
the period May-June 2015. The human mithocondrial genomes were downloaded from 
the Human Mitochondrial Genome Database (mtDB)l^ in early September 2015. 

Software Sc Computational Ressources. All MAWs were computed using 
the software provided by Pinho et al. in m- The software was run taking 
into account the reverse-complementary strand ('-rc’ command line switch) and 
requesting MAWs (’-n’ command line option) with length up to five million nu¬ 
cleotides, i.e. much longer than the expected length of the longest MAWs. The 
search for MAWs was performed on commodity desktop computers for all but the 
Human, mouse, and Norway spruce genomes, for which the computer " Ellen'P] 
from the Center for High Performance Computing (PDC) at KTH was used. 
Localization of occurrences of MAW cores on the genomes for figuresj^and S2 was 
done by aligning these subwords to their respective genomes using Bowtie2 1451 . allow¬ 
ing only strict alignments (command line option '-v O’). Identical cores from different 
MAWs are counted independently in the coverage. 
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